{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "klGNgWREsvQv"
      },
      "source": [
        "**Copyright 2018 The TF-Agents Authors.**"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "nQnmcm0oI1Q-"
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "lsaQlK8fFQqH"
      },
      "source": [
        "# SAC minitaur\n",
        "\n",
        "\u003ctable class=\"tfo-notebook-buttons\" align=\"left\"\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca target=\"_blank\" href=\"https://www.tensorflow.org/agents/tutorials/7_SAC_minitaur_tutorial\"\u003e\n",
        "    \u003cimg src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" /\u003e\n",
        "    View on TensorFlow.org\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/agents/blob/master/docs/tutorials/7_SAC_minitaur_tutorial.ipynb\"\u003e\n",
        "    \u003cimg src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" /\u003e\n",
        "    Run in Google Colab\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca target=\"_blank\" href=\"https://github.com/tensorflow/agents/blob/master/docs/tutorials/7_SAC_minitaur_tutorial.ipynb\"\u003e\n",
        "    \u003cimg src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" /\u003e\n",
        "    View source on GitHub\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "  \u003ctd\u003e\n",
        "    \u003ca href=\"https://storage.googleapis.com/tensorflow_docs/agents/docs/tutorials/7_SAC_minitaur_tutorial.ipynb\"\u003e\u003cimg src=\"https://www.tensorflow.org/images/download_logo_32px.png\" /\u003eDownload notebook\u003c/a\u003e\n",
        "  \u003c/td\u003e\n",
        "\u003c/table\u003e\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "ZOUOQOrFs3zn"
      },
      "source": [
        "## Introduction"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "cKOCZlhUgXVK"
      },
      "source": [
        "This example shows how to train a [Soft Actor Critic](https://arxiv.org/abs/1801.01290)  agent on the [Minitaur](https://github.com/bulletphysics/bullet3/blob/master/examples/pybullet/gym/pybullet_envs/bullet/minitaur.py) environment using the TF-Agents library.\n",
        "\n",
        "If you've worked through the [DQN Colab](https://github.com/tensorflow/agents/blob/master/docs/tutorials/1_dqn_tutorial.ipynb) this should feel very familiar. Notable changes include:\n",
        "\n",
        "  * Changing the agent from DQN to SAC.\n",
        "  * Training on Minitaur which is a much more complex environment than CartPole. The Minitaur environment aims to train a quadruped robot to move forward.\n",
        "  * We do not use a random policy to perform an initial data collection.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "cuztVoQgiaR9"
      },
      "source": [
        "If you haven't installed the following dependencies, run:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "KEHR2Ui-lo8O"
      },
      "outputs": [],
      "source": [
        "!sudo apt-get install -y xvfb ffmpeg\n",
        "!pip install 'gym==0.10.11'\n",
        "!pip install 'imageio==2.4.0'\n",
        "!pip install matplotlib\n",
        "!pip install PILLOW\n",
        "!pip install tf-agents\n",
        "!pip install 'pybullet==2.4.2'"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "1u9QVVsShC9X"
      },
      "source": [
        "## Setup"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "nNV5wyH3dyMl"
      },
      "source": [
        "First we will import the different tools that we need and make sure we enable TF-V2 behavior as it is easier to iterate in Eager mode throughout the colab."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "sMitx5qSgJk1"
      },
      "outputs": [],
      "source": [
        "from __future__ import absolute_import\n",
        "from __future__ import division\n",
        "from __future__ import print_function\n",
        "\n",
        "import base64\n",
        "import imageio\n",
        "import IPython\n",
        "import matplotlib\n",
        "import matplotlib.pyplot as plt\n",
        "import PIL.Image\n",
        "\n",
        "import tensorflow as tf\n",
        "tf.compat.v1.enable_v2_behavior()\n",
        "\n",
        "from tf_agents.agents.ddpg import critic_network\n",
        "from tf_agents.agents.sac import sac_agent\n",
        "from tf_agents.drivers import dynamic_step_driver\n",
        "from tf_agents.environments import suite_pybullet\n",
        "from tf_agents.environments import tf_py_environment\n",
        "from tf_agents.eval import metric_utils\n",
        "from tf_agents.metrics import tf_metrics\n",
        "from tf_agents.networks import actor_distribution_network\n",
        "from tf_agents.networks import normal_projection_network\n",
        "from tf_agents.policies import greedy_policy\n",
        "from tf_agents.policies import random_tf_policy\n",
        "from tf_agents.replay_buffers import tf_uniform_replay_buffer\n",
        "from tf_agents.trajectories import trajectory\n",
        "from tf_agents.utils import common\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "LmC0NDhdLIKY"
      },
      "source": [
        "## Hyperparameters"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "HC1kNrOsLSIZ"
      },
      "outputs": [],
      "source": [
        "env_name = \"MinitaurBulletEnv-v0\" # @param {type:\"string\"}\n",
        "\n",
        "# use \"num_iterations = 1e6\" for better results,\n",
        "# 1e5 is just so this doesn't take too long. \n",
        "num_iterations = 100000 # @param {type:\"integer\"}\n",
        "\n",
        "initial_collect_steps = 10000 # @param {type:\"integer\"} \n",
        "collect_steps_per_iteration = 1 # @param {type:\"integer\"}\n",
        "replay_buffer_capacity = 1000000 # @param {type:\"integer\"}\n",
        "\n",
        "batch_size = 256 # @param {type:\"integer\"}\n",
        "\n",
        "critic_learning_rate = 3e-4 # @param {type:\"number\"}\n",
        "actor_learning_rate = 3e-4 # @param {type:\"number\"}\n",
        "alpha_learning_rate = 3e-4 # @param {type:\"number\"}\n",
        "target_update_tau = 0.005 # @param {type:\"number\"}\n",
        "target_update_period = 1 # @param {type:\"number\"}\n",
        "gamma = 0.99 # @param {type:\"number\"}\n",
        "reward_scale_factor = 1.0 # @param {type:\"number\"}\n",
        "gradient_clipping = None # @param\n",
        "\n",
        "actor_fc_layer_params = (256, 256)\n",
        "critic_joint_fc_layer_params = (256, 256)\n",
        "\n",
        "log_interval = 5000 # @param {type:\"integer\"}\n",
        "\n",
        "num_eval_episodes = 30 # @param {type:\"integer\"}\n",
        "eval_interval = 10000 # @param {type:\"integer\"}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "VMsJC3DEgI0x"
      },
      "source": [
        "## Environment\n",
        "\n",
        "Environments in RL represent the task or problem that we are trying to solve. Standard environments can be easily created in TF-Agents using `suites`. We have different `suites` for loading environments from sources such as the OpenAI Gym, Atari, DM Control, etc., given a string environment name.\n",
        "\n",
        "Now let's load the Minituar environment from the Pybullet suite."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "RlO7WIQHu_7D"
      },
      "outputs": [],
      "source": [
        "env = suite_pybullet.load(env_name)\n",
        "env.reset()\n",
        "PIL.Image.fromarray(env.render())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "gY179d1xlmoM"
      },
      "source": [
        "In this environment the goal is for the agent to train a policy that will control the Minitaur robot and have it move forward as fast as possible. Episodes last 1000 steps and the return will be the sum of rewards throughout the episode. \n",
        "\n",
        "Let's look at the information the environment provides as an `observation` which the policy will use to generate `actions`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "exDv57iHfwQV"
      },
      "outputs": [],
      "source": [
        "print('Observation Spec:')\n",
        "print(env.time_step_spec().observation)\n",
        "print('Action Spec:')\n",
        "print(env.action_spec())"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "zuUqXAVmecTU"
      },
      "source": [
        "As we can see the observation is fairly complex. We recieve 28 values representing the angles, velocities and torques for all the motors. In return the environment expects 8 values for the actions between `[-1, 1]`. These are the desired motor angles.\n",
        "\n",
        "\n",
        "Usually we create two environments: one for training and one for evaluation. Most environments are written in pure python, but they can be easily converted to TensorFlow using the `TFPyEnvironment` wrapper. The original environment's API uses numpy arrays, the `TFPyEnvironment` converts these to/from `Tensors` for you to more easily interact with TensorFlow policies and agents.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "Xp-Y4mD6eDhF"
      },
      "outputs": [],
      "source": [
        "train_py_env = suite_pybullet.load(env_name)\n",
        "eval_py_env = suite_pybullet.load(env_name)\n",
        "\n",
        "train_env = tf_py_environment.TFPyEnvironment(train_py_env)\n",
        "eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "E9lW_OZYFR8A"
      },
      "source": [
        "## Agent\n",
        "\n",
        "To create an SAC Agent, we first need to create the networks that it will train. SAC is an actor-critic agent, so we will need two networks.\n",
        "\n",
        "The critic will give us value estimates for `Q(s,a)`. That is, it will recieve as input an observation and an action, and it will give us an estimate of how good that action was for the given state.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "TgkdEPg_muzV"
      },
      "outputs": [],
      "source": [
        "observation_spec = train_env.observation_spec()\n",
        "action_spec = train_env.action_spec()\n",
        "critic_net = critic_network.CriticNetwork(\n",
        "    (observation_spec, action_spec),\n",
        "    observation_fc_layer_params=None,\n",
        "    action_fc_layer_params=None,\n",
        "    joint_fc_layer_params=critic_joint_fc_layer_params)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "pYy4AH4V7Ph4"
      },
      "source": [
        "We will use this critic to train an `actor` network which will allow us to generate actions given an observation.\n",
        "\n",
        "The `ActorNetwork` will predict parameters for a Normal distribution. This distribution will then be sampled, conditioned on the current observation,\n",
        "whenever we need to generate actions."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "TB5Y3Oub4u7f"
      },
      "outputs": [],
      "source": [
        "def normal_projection_net(action_spec,init_means_output_factor=0.1):\n",
        "  return normal_projection_network.NormalProjectionNetwork(\n",
        "      action_spec,\n",
        "      mean_transform=None,\n",
        "      state_dependent_std=True,\n",
        "      init_means_output_factor=init_means_output_factor,\n",
        "      std_transform=sac_agent.std_clip_transform,\n",
        "      scale_distribution=True)\n",
        "\n",
        "\n",
        "actor_net = actor_distribution_network.ActorDistributionNetwork(\n",
        "    observation_spec,\n",
        "    action_spec,\n",
        "    fc_layer_params=actor_fc_layer_params,\n",
        "    continuous_projection_net=normal_projection_net)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "z62u55hSmviJ"
      },
      "source": [
        "With these networks at hand we can now instantiate the agent.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "jbY4yrjTEyc9"
      },
      "outputs": [],
      "source": [
        "global_step = tf.compat.v1.train.get_or_create_global_step()\n",
        "tf_agent = sac_agent.SacAgent(\n",
        "    train_env.time_step_spec(),\n",
        "    action_spec,\n",
        "    actor_network=actor_net,\n",
        "    critic_network=critic_net,\n",
        "    actor_optimizer=tf.compat.v1.train.AdamOptimizer(\n",
        "        learning_rate=actor_learning_rate),\n",
        "    critic_optimizer=tf.compat.v1.train.AdamOptimizer(\n",
        "        learning_rate=critic_learning_rate),\n",
        "    alpha_optimizer=tf.compat.v1.train.AdamOptimizer(\n",
        "        learning_rate=alpha_learning_rate),\n",
        "    target_update_tau=target_update_tau,\n",
        "    target_update_period=target_update_period,\n",
        "    td_errors_loss_fn=tf.compat.v1.losses.mean_squared_error,\n",
        "    gamma=gamma,\n",
        "    reward_scale_factor=reward_scale_factor,\n",
        "    gradient_clipping=gradient_clipping,\n",
        "    train_step_counter=global_step)\n",
        "tf_agent.initialize()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "I0KLrEPwkn5x"
      },
      "source": [
        "## Policies\n",
        "\n",
        "In TF-Agents, policies represent the standard notion of policies in RL: given a `time_step` produce an action or a distribution over actions. The main method is `policy_step = policy.step(time_step)` where `policy_step` is a named tuple `PolicyStep(action, state, info)`.  The `policy_step.action` is the `action` to be applied to the environment, `state` represents the state for stateful (RNN) policies and `info` may contain auxiliary information such as log probabilities of the actions.\n",
        "\n",
        "Agents contain two policies: the main policy (agent.policy) and the behavioral policy that is used for data collection (agent.collect_policy). For evaluation/deployment, we take the mean action by wrapping the main policy with GreedyPolicy()."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "BwY7StuMkuV4"
      },
      "outputs": [],
      "source": [
        "eval_policy = greedy_policy.GreedyPolicy(tf_agent.policy)\n",
        "collect_policy = tf_agent.collect_policy"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "94rCXQtbUbXv"
      },
      "source": [
        "## Metrics and Evaluation\n",
        "\n",
        "The most common metric used to evaluate a policy is the average return. The return is the sum of rewards obtained while running a policy in an environment for an episode, and we usually average this over a few episodes. We can compute the average return metric as follows.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "bitzHo5_UbXy"
      },
      "outputs": [],
      "source": [
        "def compute_avg_return(environment, policy, num_episodes=5):\n",
        "\n",
        "  total_return = 0.0\n",
        "  for _ in range(num_episodes):\n",
        "\n",
        "    time_step = environment.reset()\n",
        "    episode_return = 0.0\n",
        "\n",
        "    while not time_step.is_last():\n",
        "      action_step = policy.action(time_step)\n",
        "      time_step = environment.step(action_step.action)\n",
        "      episode_return += time_step.reward\n",
        "    total_return += episode_return\n",
        "\n",
        "  avg_return = total_return / num_episodes\n",
        "  return avg_return.numpy()[0]\n",
        "\n",
        "\n",
        "compute_avg_return(eval_env, eval_policy, num_eval_episodes)\n",
        "\n",
        "# Please also see the metrics module for standard implementations of different\n",
        "# metrics."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "NLva6g2jdWgr"
      },
      "source": [
        "## Replay Buffer\n",
        "\n",
        "In order to keep track of the data collected from the environment, we will use the TFUniformReplayBuffer. This replay buffer is constructed using specs describing the tensors that are to be stored, which can be obtained from the agent using `tf_agent.collect_data_spec`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "vX2zGUWJGWAl"
      },
      "outputs": [],
      "source": [
        "replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(\n",
        "    data_spec=tf_agent.collect_data_spec,\n",
        "    batch_size=train_env.batch_size,\n",
        "    max_length=replay_buffer_capacity)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "ZGNTDJpZs4NN"
      },
      "source": [
        "For most agents, the `collect_data_spec` is a `Trajectory` named tuple containing the observation, action, reward etc."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "rVD5nQ9ZGo8_"
      },
      "source": [
        "## Data Collection\n",
        "\n",
        "Now we will create a driver to collect experience to seed the replay buffer with. Drivers provide us with a simple way to collecter `n` steps or episodes on an environment using a specific policy."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "wr1KSAEGG4h9"
      },
      "outputs": [],
      "source": [
        "initial_collect_driver = dynamic_step_driver.DynamicStepDriver(\n",
        "        train_env,\n",
        "        collect_policy,\n",
        "        observers=[replay_buffer.add_batch],\n",
        "        num_steps=initial_collect_steps)\n",
        "initial_collect_driver.run()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "TujU-PMUsKjS"
      },
      "source": [
        "In order to sample data from the replay buffer, we will create a `tf.data` pipeline which we can feed to the agent for training later. We can specify the `sample_batch_size` to configure the number of items sampled from the replay buffer. We can also optimize the data pipeline using parallel calls and prefetching.\n",
        "\n",
        "In order to save space, we only store the current observation in each row of the replay buffer. But since the SAC Agent needs both the current and next observation to compute the loss, we always sample two adjacent rows for each item in the batch by setting `num_steps=2`."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "ba7bilizt_qW"
      },
      "outputs": [],
      "source": [
        "# Dataset generates trajectories with shape [Bx2x...]\n",
        "dataset = replay_buffer.as_dataset(\n",
        "    num_parallel_calls=3, sample_batch_size=batch_size, num_steps=2).prefetch(3)\n",
        "\n",
        "iterator = iter(dataset)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "hBc9lj9VWWtZ"
      },
      "source": [
        "## Training the agent\n",
        "\n",
        "The training loop involves both collecting data from the environment and optimizing the agent's networks. Along the way, we will occasionally evaluate the agent's policy to see how we are doing."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "65ZgD0V1Cs2t"
      },
      "outputs": [],
      "source": [
        "collect_driver = dynamic_step_driver.DynamicStepDriver(\n",
        "    train_env,\n",
        "    collect_policy,\n",
        "    observers=[replay_buffer.add_batch],\n",
        "    num_steps=collect_steps_per_iteration)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "0pTbJ3PeyF-u"
      },
      "outputs": [],
      "source": [
        "#@test {\"skip\": true}\n",
        "try:\n",
        "  %%time\n",
        "except:\n",
        "  pass\n",
        "\n",
        "# (Optional) Optimize by wrapping some of the code in a graph using TF function.\n",
        "tf_agent.train = common.function(tf_agent.train)\n",
        "collect_driver.run = common.function(collect_driver.run)\n",
        "\n",
        "# Reset the train step\n",
        "tf_agent.train_step_counter.assign(0)\n",
        "\n",
        "# Evaluate the agent's policy once before training.\n",
        "avg_return = compute_avg_return(eval_env, eval_policy, num_eval_episodes)\n",
        "returns = [avg_return]\n",
        "\n",
        "for _ in range(num_iterations):\n",
        "\n",
        "  # Collect a few steps using collect_policy and save to the replay buffer.\n",
        "  for _ in range(collect_steps_per_iteration):\n",
        "    collect_driver.run()\n",
        "\n",
        "  # Sample a batch of data from the buffer and update the agent's network.\n",
        "  experience, unused_info = next(iterator)\n",
        "  train_loss = tf_agent.train(experience)\n",
        "\n",
        "  step = tf_agent.train_step_counter.numpy()\n",
        "\n",
        "  if step % log_interval == 0:\n",
        "    print('step = {0}: loss = {1}'.format(step, train_loss.loss))\n",
        "\n",
        "  if step % eval_interval == 0:\n",
        "    avg_return = compute_avg_return(eval_env, eval_policy, num_eval_episodes)\n",
        "    print('step = {0}: Average Return = {1}'.format(step, avg_return))\n",
        "    returns.append(avg_return)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "68jNcA_TiJDq"
      },
      "source": [
        "## Visualization\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "aO-LWCdbbOIC"
      },
      "source": [
        "### Plots\n",
        "\n",
        "We can plot average return vs global steps to see the performance of our agent. In `Minitaur`, the reward function is based on how far the minitaur walks in 1000 steps and penalizes the energy expenditure."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "NxtL1mbOYCVO"
      },
      "outputs": [],
      "source": [
        "#@test {\"skip\": true}\n",
        "\n",
        "steps = range(0, num_iterations + 1, eval_interval)\n",
        "plt.plot(steps, returns)\n",
        "plt.ylabel('Average Return')\n",
        "plt.xlabel('Step')\n",
        "plt.ylim()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "M7-XpPP99Cy7"
      },
      "source": [
        "### Videos"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "9pGfGxSH32gn"
      },
      "source": [
        "It is helpful to visualize the performance of an agent by rendering the environment at each step. Before we do that, let us first create a function to embed videos in this colab."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "ULaGr8pvOKbl"
      },
      "outputs": [],
      "source": [
        "def embed_mp4(filename):\n",
        "  \"\"\"Embeds an mp4 file in the notebook.\"\"\"\n",
        "  video = open(filename,'rb').read()\n",
        "  b64 = base64.b64encode(video)\n",
        "  tag = '''\n",
        "  \u003cvideo width=\"640\" height=\"480\" controls\u003e\n",
        "    \u003csource src=\"data:video/mp4;base64,{0}\" type=\"video/mp4\"\u003e\n",
        "  Your browser does not support the video tag.\n",
        "  \u003c/video\u003e'''.format(b64.decode())\n",
        "\n",
        "  return IPython.display.HTML(tag)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "9c_PH-pX4Pr5"
      },
      "source": [
        "The following code visualizes the agent's policy for a few episodes:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "owOVWB158NlF"
      },
      "outputs": [],
      "source": [
        "num_episodes = 3\n",
        "video_filename = 'sac_minitaur.mp4'\n",
        "with imageio.get_writer(video_filename, fps=60) as video:\n",
        "  for _ in range(num_episodes):\n",
        "    time_step = eval_env.reset()\n",
        "    video.append_data(eval_py_env.render())\n",
        "    while not time_step.is_last():\n",
        "      action_step = tf_agent.policy.action(time_step)\n",
        "      time_step = eval_env.step(action_step.action)\n",
        "      video.append_data(eval_py_env.render())\n",
        "\n",
        "embed_mp4(video_filename)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [],
      "name": "7_SAC_minitaur_tutorial.ipynb",
      "private_outputs": true,
      "provenance": [],
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "pycharm": {
      "stem_cell": {
        "cell_type": "raw",
        "metadata": {
          "collapsed": false
        },
        "source": []
      }
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
